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Abstract — A  vector- valued  extension  of  the  support  vector 
regression  problem  is  presented  here.  The  vector-valued  variant 
is  developed  by  extending  the  notions  of  the  estimator,  loss 
function  and  regularization  functional  from  the  scalar-valued 
case.  A  particular  emphasis  is  placed  on  the  class  of  loss 
functions  chosen  which  apply  the  e-insensitlve  loss  function  to 
the  p-norm  of  the  error.  The  primal  and  dual  optimization 
problems  are  derived  and  the  KKT  conditions  are  developed. 
The  general  case  for  the  p-norm  is  specialized  for  the  1-,  2- 
and  oo-norms.  It  is  shown  that  the  vector-valued  variant  is  a 
true  extension  of  the  scalar-valued  case.  It  is  then  shown  that 
the  vector-valued  approach  results  in  sparse  representations  in 
terms  of  support  vectors  as  compared  to  aggregated  scalar¬ 
valued  learning. 

I.  INTRODUCTION 

Multi-task  learning  is  concerned  with  mappings  of  the 
form  f  :  M"  3^"^  where  3^  =  {0, 1}  for  classification  and 
3^  R  for  regression.  The  problem  of  multi-task  learning 
is  approachable  as  an  aggregation  of  independent  single-task 
learning  problems  /j  :  R"  i->  3’.  Certainly,  there  is  no  loss 
of  generality  with  this  approach,  however,  when  the  number 
of  tasks,  m,  is  large,  the  aggregated  approach  has  some 
disadvantages.  Regardless  of  the  single-task  method  used, 
the  aggregated  method  requires  m  optimizations,  which  for 
the  support  vector  machine  (SVM),  potentially  requires  m 
sets  of  redundant  kernel  computations.  Also,  for  SVM,  it 
is  likely  that  there  will  only  be  coincidental  commonality 
between  the  support  vectors  associated  with  the  different 
tasks.  The  impact  of  this  first  disadvantage  is  incurred  during 
training,  however,  the  cost  of  the  second  is  incurred  during 
use.  Both  of  these  costs  may  be  negligible  where  kernel 
computations  are  inexpensive,  however  as  kernel  compu¬ 
tations  become  more  expensive  (i.e.  large  n)  these  costs 
may  become  significant.  Another  motivation  according  to 
Micchelli  and  Pontil  [1]  for  multi-task  learning  is  mutual 
dependence  among  the  tasks,  which  the  aggregated  approach 
ignores.  Such  dependence  occurs  when  y  =  f(x)  is  an 
embedding  (which  is  certain  when  n  <  m).  In  such  a 
case,  the  outputs  y  lie  on  a  subspace  embedded  in  R”". 
For  regression  such  embeddings  occur  with  the  use  of  the 
unit  quaternions  to  represent  rotations  in  3-space,  which  are 
locally  parameterized  by  three  orthogonal  coordinates,  but 
embedded  in  R^.  They  also  occur  in  the  configuration  space 
of  mechanisms  with  closed  kinematic  chains  for  which  a 
global  parameterization  is  not  available.  These  cases  as  well 
as  others  motivate  the  development  of  multi-task  learning 
methods  and  in  particular  the  multi-task  SVM  for  which  the 
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regression  problem  will  be  addressed  here.  Such  regression 
problems  are  described  as  vector-valued  in  the  sequel. 

Recent  work  on  vector-valued  support  vector  regression 
(VV-SVR)  is  as  follows.  Vazquez  and  Walter  [2]  use  a 
separately  trained  Matern  class  kernel  for  each  task  and  then 
use  the  single-task  SVM  for  training.  Micchelli  and  Pon¬ 
til  [1],  [3]  give  a  theoretical  treatment  of  reproducing  kernel 
Hilbert  spaces  (RKHS)  in  the  range-space  of  the  estimator. 
Their  result  is  an  extension  of  the  traditional  scalar-valued 
kernel  function  to  an  operator-valued  kernel.  Ben-David 
and  Schuller  [4]  develop  conditions  under  which  learning 
multiple  tasks  is  provably  beneficial.  Evgeniou  and  Pontil  [5] 
consider  the  learning  of  an  average  task  simultaneously  with 
small  deviations  for  each  task.  Evgeniou,  Micchelli  and 
Pontil  [6]  extend  their  earlier  results  by  developing  indexed 
kernels  with  coupled  regularization  functionals. 

In  contrast  to  this  previous  work,  this  paper  emphasizes 
the  choice  of  loss  function  in  the  vector-valued  regression 
problem.  Prior  work  on  the  loss  function  by  Perez-Cruz  et 
al  [7]  used  the  squared  Euclidian  norm  of  the  error  with  a 
hyper-spherical  insensitive  zone.  Also,  Sanchez-Femandez, 
et  al  [8]  used  a  shifted  squared  Euclidian  norm  for  a  differ¬ 
entiable  loss  function.  These  two  approaches  do  not  reduce  to 
the  traditional  SVR  loss  function  in  one-dimensional  cases. 
The  VV-SVR  proposed  here  generalizes  the  £-insensitive  loss 
function  of  the  scalar- valued  case.  It  follows  the  traditional 
scalar- valued  SVM  development  [9].  The  problem  is  first 
setup  by  defining  a  regularized  risk  functional  which  ex¬ 
tends  the  scalar-valued  case.  This  problem  is  then  cast  into 
primal,  Lagrangian  and  then  dual  forms.  We  then  develop 
the  Karush-Kuhn-Tucker  (KKT)  conditions  to  relate  the  dual 
variables  to  the  primal  variables  which  are  used  to  find  the 
bias.  The  general  case  is  then  specialized  for  the  common 
norms  and  specific  approaches  to  determination  of  the  bias 
are  developed.  The  method  is  demonstrated  and  shown  to 
be  sparse  in  support  vectors.  The  paper  concludes  with  a 
comparison  of  the  vector-valued  case  to  the  scalar-valued 
case  and  some  observations. 

II.  Scalar- Valued  Support  Vector  Regression 

In  the  scalar-valued  support  vector  regression  (SV-SVR) 
problem  one  seeks  to  model  a  causal  relationship  /  : 
R"  h->  R  between  inputs  x  and  an  output  y  from  a  finite 
set  of  observations  {(xj,yi)}f.  Generally  such  a  regression 
problem  takes  the  form  of 

t 

Min:  Rreg  = '^(Tr) -bC'^L(yi,:y  (xi,7r))  (1) 

i=i 

in  which  we  wish  to  minimize  both  the  summed  loss  and 
a  regularization  functional  simultaneously.  For  the  SV-SVR 
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problem  the  estimator  y{-,-)  is  chosen  as 


y(x,  {w,  &})  =  (w,  (i){x)}  +  b  (2) 


where  <^  :  M”  (or  0  :  R"  i— >  L^)  is  a  nonlinear 

mapping  to  a  high-dimensional  feature  space  and  clearly 
TT  =  {w,  b}  where  w  is  the  weight  vector  and  b  is  the 
bias.  We  desire  that  the  SV-SVR  perform  well  on  our  set 
of  observations,  so  we  choose  w  and  b  so  that  the  summed 
loss  L{yi,yi)  is  minimized  where  the  loss  function  L{-,  •) 
is  typically  the  e-insenstive  loss  function  L  {y,  y)  =  L  (e)  — 
|e|^  =  max(0,  |e|  —  e)  where  e  =  y  —  y.  Since  u  ^  t, 
the  minimization  of  the  summed  loss  alone  is  ill-posed 
and  therefore  traditional  SVR  introduces  the  regularizational 
functional  P(w)  =  ^  (w,  w)  to  stabilize  the  solution.  With 
these  choices,  the  SV-SVR  optimization  problem  in  (1) 
becomes 


Min 

{w,  6} 


■reg  =-  (w,  w) 


(3) 


i=l 


Now  because  this  problem  is  cast  in  a  large  space  of  parame¬ 
ters  {w,  b}  and  because  it  is  non-smooth,  it  is  transformed 
into  the  dual  problem 


Max  : 
{Pi}i 


1  ^ 

-D  =  ^  I3il3jk{xi,xj)  -t- 


»d=i 


»=i 


(4) 


function  since  the  creates  a  cusp  in  the 

objective  function  (4),  “trapping”  solutions  at  ft  =  0.  Both 
the  £-insensitive  zone  and  the  cusp  vanish  if  £  =  0.  The 
boundedness  of  each  ft  is  attributable  to  the  linear  part  of 
the  loss  function  (i.e.  the  bound  on  ft  is  related  to  see 
Smola  and  Scholkopf  [10,  pg.  13]).  Since  both  the  sparsisity 
and  the  boundedness  of  the  estimator  limit  the  number  of 
free  parameters  of  the  estimator,  in  a  practical  sense  the 
generalization  properties  of  SV-SVR  may  be  attributable  to 
the  form  of  the  loss  function. 

III.  Vector-Valued  Support  Vector  Regression 

We  will  now  extend  those  concepts  just  discussed  for  the 
SV-SVR  case  to  the  vector-valued  case.  For  each  choice 
made  we  will  maintain  the  concepts  from  the  scalar-valued 
case  to  assure  that  the  VV-SVR  is  a  true  generalization  of 
the  SV-SVR. 

A.  Problem  Setup 

In  the  vector-valued  case,  the  process  to  be  estimated  is 
of  the  form  f  :  R"  h->  R™  which  maps  inputs  x  £  R"  to 
vector- valued  outputs  y  £  R"^.  From  a  finite  set  observations 
{(xi,  yt)}i,  our  goal  is  to  find  a  function  y(x,  tt)  which  will 
be  trained  over  its  free  parameters  tt  as 

e 

Min:  i?reB  =  -b  C  VL(yi,y(xi,7r)) .  (6) 

TT  A—/ 

i=l 

The  structure  of  P(-),  y(-,7r)  and  L{-,  •)  are  now  chosen  to 
extend  the  same  concepts  from  the  SV-SVR  case.  For  VV- 
SVR,  the  family  of  functions  y(-,  •)  will  take  the  linear  form 


t 

S.T.:  IA1<C. 

i=l 

where  k{xi,Xj)  =  {(f}{Xi),  0(xj))  is  the  kernel  function. 
Given  the  fact  that  w  =  j3i(p{xi),  the  final  form  of  the 
estimator  then  becomes 

y(x)=  ftA:(xi,x)-b6  (5) 

i€lsv 

where  Isv  —  {i  '■  f  0}  denotes  the  set  of  indices  of 
the  support  vectors  Xj  and  b  is  determined  using  the  KKT 
conditions,  which  in  the  SV-SVR  case  may  be  briefly  stated 
as 


y(x;{W,b})^W</)(x)  +  b  (7) 


which  generalizes  (2)  where  the  free  parameters  tt  = 
{W,  b}  consist  of  the  weights  W  G  and  the  bias  b  € 

R"*.  Similar  to  the  SV-SVR  case,  a  quadratic  regularization 
functional  is  chosen  as  'P(W)  =  |  Tr(WW^).  Finally,  the 
loss  function  must  be  extended  such  that  :  R”*  x 

R”*  I— >  R^..  To  construct  this  loss  function  in  the  spirit  of 
the  SV-SVR,  it  is  natural  to  maintain  the  concept  of  the  e- 
insensitive  loss  function.  Such  an  extension  is  obtained  by 
applying  |  •  |e  to  a  norm  of  the  error  e  =  y  —  y  which  yields 


L(e)4 


(8) 


ft  =  0  ^  IgI  <  e, 

0  <  IftI  <  C  =»  \ei\  =  6, 

IftI  =C  \ei\  >  £, 

Pi  fO  =>  ftei  >  0. 

This  SV-SVR  estimator  has  several  desirable  properties: 
(1)  it  generalizes  well,  (2)  it  is  based  on  linear  math, 
(3)  the  optimization  problem  is  convex  and  (4)  the  dual 
problem  is  quadratic.  In  particular,  the  SVR’s  generalization 
ability  is  attributable  to  the  sparsisity  and  boundedness  of 
the  dual  problem  (4)  solution.  Sparsisity  is  attributable  to 
the  £-insensitive  zone  (i.e.  — £  <  Cj  <  e)  of  the  loss 


where  e  =  [ei 


and  II e lip  is  the  p-norm  defined 


~  oo. 
<  £,  is 


by  for  1  <  p  <  oo  and  maxj  |  Cj  |  for  p 

This  loss  function  has  an  insensitive  zone  for  ||e 
a  true  generalization  of  |  •  |e,  and  has  a  “linear”  behavior  for 
||e||p  >  £.  Combining  these  choices  for  y(-,')’  rr,  V{-) 

and  l(-),  (6)  becomes 


Min 

{W,b} 


Rr. 


=  -Tr(W’W^) 


■C?!]  ||y,-W0(x,)-b||^ 


(9) 


Z. 


which  we  note  is  a  generalization  of  (3).  In  the  next  sec¬ 
tion  we  will  convert  (9)  into  a  suitable  form  for  practical 
optimization  by  developing  the  dual  problem  in  the  space  of 
Lagrange  multipliers. 


B.  Dual  Problem  Development 

Observe  that  (9)  is  non-smooth  and  may  be  infinite  di¬ 
mensional;  however,  it  may  be  simplified  by  deriving  the 
dual  problem.  First,  the  objective  function  may  be  smoothed 
by  introducing  the  slack  variables  ^i.  Si  and  S*  resulting  in 
the  primal  optimization  problem 


1 

Min:  -Tr(WW^)  FCVCi 

w,b,  2  ^ 

S.T.:  ll^i -b  i5*||p -£  <  0,  6  >  0, 

Yi  -  W0(Xi)  -  b  -  (5i  <  0, 

-yi  +  W4>{xi)+b-5;<0, 

Si  >  0,  S*  >  0,  Vi  = 

(10) 

where  the  inequalities  are  taken  element-wise  and  ||  •  ||p  is 
defined  without  the  absolute  value  since  St,  S*  >  0.  Now 
to  reduce  the  dimension  of  the  problem,  we  will  cast  the 
primal  problem  P  into  Lagrange  form  by  introducing  the 
multipliers,  ai,  pi,  ^i,  ^*,  6i,  9*  as 


LA^Tr{WW^)  +  Cj2^i 

i=l 

e 

+  J2ai(\\di  +  d*\\,-^i-e) 

i=i ' - „ - ' 

HI 


i  e 


i  =  l 
£ 


VI 


VII 


i=l 


VIII 


S.T.:  ai  >0,7]i>  0,  7^’^  >  0,  9\*^  >  0 


which  is  to  be  minimized  over  W,  b,  and  5-*^  and  to  be 
maximized  over  a^,  rp,  7-*^  and  d[*'^  for  i  —  I, . . .  Now 
since  the  primal  variables  are  no  longer  constrained  and  all 
constraints  are  imposed  on  the  dual  variables,  we  minimize 
(11)  with  respect  to  the  primal  variables.  Minimizing  with 


respect  to  W,  b,  5\  and  yields 

e 


W  =  Y^Ticf>^{Ki)  (12) 

i=l 

e 

^ri  =  o  (13) 

1=1 

e.‘'>=».^(ll'S.  +  «.'llp)-7P  (14) 

7]i  =  C-ai  (15) 


respectively  where  Ti  =  7,  -  7^  These  relationships  along 
with  the  constraints  r]i>0  and  >  0  imply  that 


0<ai<C  (16) 

0<7/*^<ai^(||-5,  +  -5nip).  (17) 

06]  ^  ' 

Observe  that  the  results  expressed  in  (12)  and  (13)  are  similar 
to  those  for  the  SV-SVR.  By  combining  (12)  with  y  in  (7) 
the  expression 

i  I 

y(x)  =  Ti4)'^{xi)(j){x)  -b  b  ^  Tik{xi,x)  -b  b 

i=l  i=l 

of  the  VV-SVR  estimator  is  obtained  which  may  be  com¬ 
pared  to  (5).  Now  the  conditions  expressed  in  (17)  reflect  a 
complicated  coupling  between  the  primal  and  dual  variables 
which  may  be  simplified  with  the  following  lemma. 

Lemma  1:  Let  V  =  (R",  ||  •  ||p)  define  a  normed  vector 
space  and  let  x  G  V.  Also,  let  ||  •  H,  be  the  dual  or  conjugate 
norm  of  j)  •  ||p  (that  is  ^  -b  ^  =  1).  Then 


for  1  <  p  <  00. 

Proof:  We  will  examine  three  cases.  We  begin  with  the 
smooth  case  and  then  address  the  non-smooth  cases  where 
pG{l,oo). 

Case  1:  Consider  1  <  p  <  00  for  which  H-Hp  is 
smooth  everywhere  except  at  0.  In  this  case  for  x  = 

[xi  Xn]^  G  R"  we  have  l|xilp  =  (ELilx/)" 

then  ^  (llxllp)  =  [ilXiT”^  •••  ilXnT^E- 

Now  since  q  =  direct  manipulation  yields 


==  1. 


In  the  following  two  cases  the  norms  ll'llj  and  H-H^  are 
non-smooth.  However,  it  will  be  shown  that  norm  may  be 
modeled  as  a  collection  of  smooth  subspaces  which  cover  V. 


The  gradient  may  then  be  calculated  in  each  of  these  sub¬ 
spaces  to  yield  all  of  the  possible  directions  of  maximum 
ascent  at  a  non-smooth  point.  In  this  sense  the  gradient  is 
multi-valued  at  non-smooth  points,  however,  it  is  shown  that 
the  dual  norm  of  each  of  these  possible  values  is  always  1. 

Case  2:  Next,  consider  p  =  1.  In  the  smooth  regions 
^  ||;;^||^  =  [±1  ±l]  which  clearly  has  an  oo-norm 

of  1.  Non-smooth  regions  exist  for  any  Xt  =  0  (called  the 
coordinate  subspaces  here).  Let  Xns  denote  a  candidate  point 
in  one  of  these  coordinate  subspaces.  We  define  the  smooth 
coordinate  vector  C{Xns)  —  Pi  ' ' '  ^m]'^  as 


^iiX.ns') 


0,  if  Xi  =  0 

sign(xi)>  ifXiy^O 


the  zero  coordinate  vector  as  Z{Xns)  —  1  ~  1^1  and  the 
ternary  permutation  vector  as  'TiXns)  =  [^ 
where  Ti  G  {-1,0,1},  V  i  =  l,...,Tn.  In  the 

neighborhood  of  a  non-smooth  point  Xna>  norm  may 
be  modeled  as  the  linear  functional  given  by  Hxlli  = 
(C(Xr^J +'2(x„J  oT(x„J)^X  where  o  denotes  an 
element-wise  product.  It  is  clear  that  all  of  the  directions  of 
maximum  ascent  (defined  by  the  permutations  of  Z{-)o'T (■)) 
from  such  a  non-smooth  point  are  given  by 


A 

dx 


(llx 


e{C(x„J  +  2(x„JoT(x„J}\{0}. 

X=Xns 


all  of  which  have  an  oo-norm  of  1,  even  for  Xn^  =  0- 
Case  3:  Finally  consider  p  ~  oo.  In  smooth  regions 
^  llxlloo  =  [■ "  ' "]  which  clearly  has  a 

1-norm  of  1.  Non-smooth  regions  exist  for  any  |xi|  = 
\Xj\  =  llxlloo’  ^  ^  j  which  are  called  the  maximal 
equality  subspaces  here.  Let  Xns  denote  a  candidate  point  in 
one  of  these  maximal  equality  subspaces.  Define  the  active 
coordinate  vector  ►4(Xns)  “  as  follows 


A(x)  = 


0, 

sign(xi)> 


if  IXil  <  llxlloo 
if  IXil  =  llxlloo 


and  the  binary  permutation  vector  as  ^{.Xns)  — 
[Bi  •  •  ■  BjnY'  where  Bi  G  (0, 1},  V  i  =  1, . . . ,  m  are 
the  permutations  of  directions  available  for  x  to  increase 
along  its  maximal  equality  subspaces.  In  the  neighborhood 
of  a  non-smooth  point  Xns’  the  norm  may  be  modeled  as  the 
linear  functional  given  by 


^  (A(Xn J  °  g(Xn 

l|A(x„J  °S(x„J||A  ' 


Therefore,  all  of  the  possible  directions  of  maximal  ascent  in 
an  immediate  neighborhood  of  a  maximal  equality  subspace 
are  then 

A(XnJ  °  ^(X  ns  )  \  \  -T  Ol 

IIAXnJoe(X„JllJ 


A 

dx 


Now  it  is  clear  that  the  1-norm  of  each  of  these  possible 
gradients  is  equal  to  one.  ■ 


Now  if  we  take  the  g-norm  of  (17)  and  subsequently  apply 
Lemma  1  along  with  (16)  we  obtain  ||ri||^  <  cti  <C  from 
which  we  conclude  (without  loss  of  generality)  that 

ai  =  \\Ti\\^<C.  ‘(18) 

because  it  may  be  shown  that  a*  >  ||ri||^  is  always  sub- 
optimal.  We  may  now  substitute  (18),  (12),  (13),  (14)  and 
(15)  into  the  Lagrangian  problem  (11)  to  obtain  the  dual 
problem  expressed  in  terms  of  {rj}{  as 

1  , 

Max:  D  =  --  rfr,fc(xi,Xj) 

+  E>'rr<-  =  El|r.ll,  (19) 

i=l  i~l 

i 

S.T.:  ^r,  =  o,  ||ri||,<c. 

Here  we  note  a  similar  structure  to  the  scalar-valued  case  as 
shown  in  (4)  and  upon  careful  examination  of  (19)  and  (4) 
we  observe  that  they  are  identical  when  m  =  1,  thus  (19) 
generalizes  the  SV-SVR  problem.  We  will  now  develop  the 
KKT  conditions. 


C.  Karush-Kuhn-Tucker  (KKT)  Conditions 

In  the  SVM  literature  the  KKT  conditions  state  that  at 

the  optimum  the  product  of  each  Lagrange  multiplier  and  its 

associated  constraint  must  vanish.  For  our  particular  problem 

the  KKT  conditions  indicate  that  terms  III  through  VIII  of 

(11)  must  each  vanish  at  the  optimum.  Let  the  error  be 

defined  as  =  y;  -  Wd>{xi)  -  b  and  let  the  elements 

1  T 


of  the  vectors  be  as  follows  = 


(*) 


ii,m 


S] 


(») 


"«,1 


d*) 


dim 


and  Bi  =  [ei,i  •••  Ci^m]  ■ 
We  begin  by  stating  without  proof  the  rather  obvious  fact 
that  5ijd*j  =  0  from  (10).  Likewise  by  (ll.V)  and  (ll.VI) 
we  have  lijTij  =  0-  ■^l^o  by  the  construction  of  (10),  we 
may  choose  without  loss  of  generality  that  6i  —  5*  s  e^. 
Furthermore,  when  a*  =  ||ri||^  0  according  to  (ll.V), 

(ll.VI),  (11. VII)  and  (ll.VIII)  we  have 

/  ,  \  p-i 

Ti  0  \  /  lei 


^  (lleillp)  =ssign(ei) 


(20) 


which  for  1  <  p  <  oo  becomes  =  (||erij~)  • 

implies  that  there  exists  a  directional  relationship  between 
the  error  Oj  and  the  Lagrange  multiplier  T,  when  ||ri||^  0. 

In  addition  to  these  directional  relationships,  the  magnitude 
of  the  dual  variable  F,  in  vector-valued  case  yields  informa¬ 
tion  regarding  the  magnitude  of  the  error  Bi.  To  explore  this 
relationship,  consider  the  three  cases  of  aj  =  ||ri||^  with 
resp>ect  to  its  constraints  at  0  and  C  as  shown  in  (16). 
Vanishing  ||rj||^.  First,  consider  the  case  where  HFiH^  = 
ai  —  0  which  implies  that  |lei||p  —  —  e  0  by  (11. Ill)  and 

since  =  0  by  (ILIV)  and  (15),  it  follows  that  |lei||p  < 
£  since  l|ei||p  >  e  would  violate  the  constraint  in  (10). 
Therefore,  we  conclude  that  lirill,  =0  ||ei||p<£. 


Unconstrained  IlFill^.  Next,  consider  the  case  where 
IlFill,  =  Qj  G  (0,  U)  .  Since  ai  0  then  l|ei||p-5i-£  =  0 
by  (11. III).  Also,  since  a*  /  C  it  follows  that  77j  7^  0  by  (15) 
which  in  turn  implies  that  =  0  from  (11. IV).  Hence,  we 
conclude  that  ||ril|^  G  (0,C)  ==>  ||ei||p  =  s. 

Bounded  ||ri||^.  Finally,  consider  the  case  where  IIFiH^  = 
ai  =  C.  This  implies  that  Tjj  =  0  by  (15),  consequently 
7^  0  from  (11. IV)  and  due  to  the  constraint  in  (10),  >  0. 

Additionally,  since  Ui  =  C,  it  follows  that  ||ej||p-^i-£  =  0 
according  to  (11. Ill)  which  implies  ||ej||p  =  6  +  ^  >  e. 
Hence,  it  is  clear  that  ||Fjl|^  =  C  ==^  ll^iUp  >  e. 


D.  Determining  the  Bias 


The  VV-SVM  optimization  problem  is  solved  in  the  dual 
space  of  Lagrange  multipliers,  {Fi}f,  leaving  the  bias,  b, 
from  the  primal  problem  (10)  yet  to  be  determined.  Just  like 
the  scalar- valued  SVM,  b  is  completely  determined  by  {Fi}j 
based  on  the  KKT  conditions  just  derived.  Let  the  support 
vectors  be  those  input  vectors  Xj  for  which  ||Fj||^  7^  0,  then 
for  each  support  vector  which  is  on  the  margin  (||Fil|^  € 
(0,  C))  we  know  that  the  magnitude  of  the  error  is  given 
by  llcillp  =  £  and  that  the  direction  is  given  by  (20).  Let 
the  biased  error  be  Fj  =  e*  -b  b  =  yi  —  Fjfc(xj,Xj) 

and  the  signature  be  cTj  =  sign(Fi)  where  sign(-)  is  taken 
element-wise  and  sign(O)  =  0,  then  for  all  r  €  A4  =  {i  : 
||Fil|^  G  (0,  C)}  and  1  <  p  <  00  the  KKT  conditions  require 
that 


b  =  Fj  —  £CTi  o 


(21) 


which  allows  the  bias  to  be  calculated  from  any  element  in 
M.  Note  that  this  method  will  not  work  for  p  =  1  or  p  ~ 
00  because  (20)  does  not  fully  convey  all  of  the  necessary 
direction  information  to  properly  assess  the  bias.  In  these 
cases,  one  may  have  to  use  up  to  m  points  from  A4  with 
linearly  independent  signatures  to  determine  the  bias. 


IV.  Specific  Formulations  for  Common  Norms 

The  results  presented  thus  far  have  been  derived  for  the 
general  case  1  <  p  <  00  which  is  primarily  of  theoretical 
interest.  For  practical  computational  interests,  values  other 
than  p  =  1, 2, 00  are  of  less  value  due  to  their  complexity. 
The  cases  of  p  =  1  and  p  ~  00  are  appealing  because  they 
result  in  linear  math.  The  case  of  p  =  2  is  appealing  because 
it  is  Euclidian,  results  in  a  symmetry  between  the  primal  and 
dual  spaces  (since  q  =  2)  and  is  mathematically  tractable.  In 
this  section  each  of  these  three  cases  are  studied  with  regard 
to  a  solution  strategy. 


A.  1-Norm 

In  this  case  we  have  p  =  1,  therefore,  5  ~  00  in  (19)  and 
re-introducing  aij,  the  dual  problem  becomes 

1  , 

Max:  D  = ^  FfFjfc(x„X7) 

^  i,j=i 

i=l  i=l 

I 

S.T.;  ^  Fi  =  0,  -Qil  <  Fi  <  ail,  0  <  <  C 

i=l 

which  is  quadratic  in  its  objective  and  linear  in  its  constraints. 
It  can  be  solved  with  standard  quadratic  programming  soft¬ 
ware.  For  each  support  vector  which  is  on  the  margin  (i  G 
M)  the  KKT  conditions  indicate  that  ||eil|j  =  £.  This 
information  must  be  exploited  to  find  the  bias  b  because 
(21)  cannot  be  used  for  p  =  1. 

To  determine  the  bias,  m  marginal  support  vectors  must 
be  found  which  have  linearly  independent  signatures,  ai.  So 
for  i  e  M,  ||ei|li  ==  e  may  be  computed  as  ajei  =  e  hence 
it  follows  that  ^  ^ 

afh  =  afFi-e 

alh  =  Fr-s 

where  =  afFi.  Amassing  all  k  samples  in  M,  a 
consistent  system  of  over-determined  equations  is  obtained 
as 


5b  =  F"^  -  £l 

where  5  =  [ctj  •  •  •  ak\^  and  F"'  =  [Ff  •  ■  ■  F^] 

are  introduced.  This  system  is  easily  solved  by  one  of  two 
methods.  Since  the  system  of  equations  is  consistent,  m 
independent  rows  may  be  extracted  from  5  and  the  equation 
solved  directly  or  the  entire  matrix  5  may  be  inverted  using 
the  Moore-Penrose  pseudo-inverse 

b  =  5+  (F"^  -  £l) 

where  5^  =  5^. 

B.  2-Norm 

In  this  case  p  ==  2,  therefore,  g  =  2.  By  re-introducing  ai, 
the  dual  problem  becomes 


which  is  quadratic  in  its  objective  but  nonlinear  in  its  con¬ 
straints.  It  must  be  solved  using  general  nonlinear  program¬ 
ming  software.  Fortunately,  the  objective  and  the  constraints 
are  smooth,  so  gradient  information  is  available  to  be  used 
in  the  optimization  process.  For  each  support  vector  which  is 
on  the  margin,  the  KKT  conditions  indicate  that  |leil|2  =  £- 
This  information  may  be  exploited  to  find  the  bias  b  but 
the  use  of  Equation  (21)  is  permitted.  So  in  this  case  the 
following  holds 


b  =  F,  - 


•  »II2 


V  ieM 


where  the  signature  (Tj  from  (21)  is  not  needed  because  q  — 
1  =  1  is  whole  and  odd  which  preserves  the  sign  information 
in  Fi. 

C.  co-Norm 

In  this  case  p  =  oo,  therefore,  g  =  1.  By  re-introducing 
7^  and  7*,  the  dual  problem  becomes 

1 

Max;  D  =  --  ^  (7i  -  liVhj  “ 

{'I'i’T'.*}  i,j=i 

+  yi  -  Ti )  -  £  ) 


2  =  1 


i=l 


where  S  =  [diag(cri)  •••  diag(<Tfe)]'‘ 
and  <T  =  [*^1^  ■  ■  ■ 


’diagjo-i)' 

Ff  -elo-if 

b  = 

_diag(crfc)_ 

Ffc  -ekfel. 

[Ff 


S.T.:  l](7i-7*)=0>  l^(7i+7n<C^, 

7i  >  0,  7*  >  0 

which  is  quadratic  in  its  objective  and  linear  in  its  constraints. 
It  can  be  solved  with  standard  quadratic  programming  soft¬ 
ware.  For  each  support  vector  which  is  on  the  margin,  we 
know  that  ||ei||(^  =  e.  This  information  must  be  exploited 
to  find  the  bias  b  because  (21)  cannot  be  used  for  p  ~  00. 

Due  to  the  nature  of  the  optimization  problem,  Fi  will 
typically  be  sparse  because  one  component  will  be  more 
effective  at  increasing  the  objective  function  than  the  others^. 

It  is  also  seen  from  (20)  that  =  linip^oo  (||Id|^  )vi^ 

so  for  any  eij  ^  W^iW^o  follows  that  Fij  =0  and 
versa.  So  for  any  element  of  F ij  =  0  no  conclusion  may  be 
drawn  with  regard  to  Cij  other  than  Cij  <  HeiH^.  Then  to 
determine  the  bias,  at  most  m  support  vectors  must  be  found 
which  have  linearly  independent  signatures.  So  for  i  G  Ad, 

1 1  1 1 00  “  equivalent  to  cTj  o  e,  =  s\cri\  hence  it  follows 

that 

ai  oh  =  ar^oFi-  e\(Ti\ 
diag(cri)b  =  Ff  -  e\(Ti\ 

where  Ff  =  o  Fj.  Amassing  all  k  samples  in  A4,  a 
consistent  system  of  over-determined  equations  is  obtained 
as 


^b^F-^-elo-l 


(Tk^  \  are 

introduced.  This  system  is  easily  solved  by  one  of  two 
methods.  Since  the  system  of  equations  is  consistent,  m 
independent  (and  non-zero)  rows  may  be  extracted  from  S 
and  the  equation  solved  directly  or  the  entire  matrix  S  may 
be  inverted  using  the  Moore-Penrose  pseudoinverse  yielding 
b  =  5+  (F-5-£|(t1). 

V.  Experimental  Demonstrations 

The  first  of  two  examples  concerns  learning  a  mapping 
f  :  ^  given  by  y(x)  =  sinc(x)  cos(O.lx^)] 

which  is  suitable  for  visualization.  We  choose  50  equally 
spaced  samples  on  x  £  [0, 10]  as  the  input  data,  a  RBF 
kernel  with  7  =  |,  C  =  100  and  e  =  0.1.  The  results  of 
the  VV-SVR  training  with  p  =  1,  p  =  2  and  p  ~  00  are 
shown  in  Figures  1,  2  and  3  respectively.  Each  of  these 
figures  contain  four  plots  illustrating  the  solution.  For  the 
three  different  norms,  there  were  16,  16  and  19  support 
vectors  found  respectively. 

The  second  example  is  a  fit  of  the  Hwang  data  set  [11] 
(which  is  available  at  the  Delve  database  [12])  which  consists 
of  a  function  H  ;  [0,1]^  R®.  Our  intention  here  is 

to  demonstrate  the  sparsisity  of  the  VV-SVR  approach  as 
compared  to  the  aggregated  SV-SVR  approach.  In  this  case 
we  use  a  sample  size  of  f  =  2, 000.  For  the  VV-SVR 
we  choose  e  =  0.5  and  p  =  g  =  2.  To  obtain  a  fair 
comparison  we  choose  a  compatible  value  of  e  for  the  SV- 
SVM  by  assuring  that  the  hyper-volume  of  the  hyper-cube 
{e  :  -£l  <  e  <  el}  be  the  same  as  the  hyper- volume  of 
the  ball  {e  :  ||e||2  <  0.5}.  We  therefore  choose  s  =  0.3485 
for  the  scalar-valued  case.  Upon  performing  the  calculations 
it  was  found  that  the  VV-SVR  method  is  indeed  sparser  in 
support  vectors  than  the  aggregated  SV-SVRs  (which  used 
LIBSVM  [13]).  Of  the  2,000  training  points,  both  methods 
discovered  a  total  of  124  unique  support  vectors  between 
them,  55  for  the  VV-SVR  method  and  6,  20,  28,  29  and  25 
for  Hi  (x)  through  H5  (x)  respectively  for  92  unique  values 
for  the  aggregated  SVR  method.  In  both  cases  we  chose  a 
RBF  kernel  with  7  =  8  and  C  =  100.  We  observe  that  each 
SV-SVR  is  individually  sparser  than  the  VV-SVR,  however  in 
aggregate,  they  are  less  sparse  than  the  VV-SVR  method.  The 
sparseness  of  the  VV-SVR  is  attributable  to  the  third  term 
in  the  right  hand  side  of  (19).  This  term  adds  a  cusp  to  the 
objective  function  which  ’’traps”  some  F^  at  0,  thus  resulting 
in  aggregate  sparsisity.  For  estimators  with  large  dimensional 
input  spaces,  the  kernel  evaluation  becomes  significant  in 
the  computation  of  the  estimate;  it  is  therefore  desirable  to 
obtain  the  sparsest  solution  in  terms  of  support  vectors  for 
efficiency  of  evaluation.  It  is  in  this  regard  that  VV-SVR  has 
an  advantage  over  the  aggregated  SV-SVR  approach.  These 
sparsisity  results  are  illustrated  in  Figure  4  where  are  shown 
the  VV-SVR  support  vectors  (left),  the  scalar-valued  SVM 
support  vectors  (center)  and  the  aggregated  scalar-valued 
SVM  support  vectors  (right). 


(a)  (b)  (c)  (d) 


Fig.  1.  1-norrn  VV-SVR  Approximation  of  y(x).  (a)  Original  function  yi(x)  and  yi(x).  (b)  Errors  e;  and  the  ball  ||e||i  =  e.  (c)  Lagrange  multipliers 
Fi  vs.  X.  (d)  Lagrange  multipliers. 


(a)  (b)  (c)  (d) 


Fig.  2.  2-norm  VV-SVR  Approximation  of  y(i).  (a)  Original  function  yi(x)  and  yi(x).  (b)  Errors  Cj  and  the  sphere  ||e||2  =  e.  (c)  Lagrange  multipliers 
Fi  vs.  X.  (d)  Lagrange  multipliers. 


(a)  (b)  (c)  (d) 


Fig.  3.  oo-norra  VV-SVR  Approximation  of  y(a:).  (a)  Original  function  yi(x)  and  yi{x).  (b)  Errors  e;  and  the  ball  ||e||oo  —  £.  (c)  Lagrange  multipliers 
Fi  vs.  X.  (d)  Lagrange  multipliers. 


VI.  Observations  and  Conclusions 


First  we  observe  that  VV-SVR  is  an  extension  of  the  SV- 
SVR  in  that  they  are  equivalent  when  m  =  1.  Table  I  shows 
a  comparison  of  the  two  methods.  Secondly  we  observe  that 
the  aggregated  SV-SVR  approach  is  equivalent  to  the  VV- 
SVR  if  in  (8)  we  let  L(e)  =  ll|e|e||i.  Thirdly  we  observe 
that  dual  variables  which  are  at  bound  (i.e.  llTiHg  =  C) 
retain  m— 1  degrees  of  freedom.  Finally,  we  conclude  that  the 
advantages  of  VV-SVR  proceed  from  the  fact  that  they  result 
in  sparser  solutions  and  thus  more  efficient  implementations. 
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Fig.  4.  Sparsisity  of  VV-SVR  vs.  SV-SVR.  Shown  are  124  unique  support 
vectors  found.  The  rows  represent  unique  indices  i  and  the  columns  indicate 
the  outputs  Hi{x)  through  H5(x)  from  left  to  right.  A  black  cell  indicates 
a  support  vector,  (a)  The  55  support  vectors  for  the  VV-SVR  method,  (b) 
Plot  of  the  support  vectors  found  for  each  SV-SVR.  (c)  The  92  support 
vectors  for  the  aggregated  SV-SVR  method. 


